INSTRUCTIONS:

  1. Give your answers below each numbered task. Enter a new line.
  2. The tasks are clustered under level 2 headers (##). Do one set at a time.
  3. Preview your notebook at the end of each in class exercise task set. Does the notebook look formatted correctly? If not fix it. In most cases the issue is

KEYBOARD SHORTCUTS:

  1. Assignment operator (<-) is Alt+- for Windows and Option+- for Mac
  2. Insert new code chunk Ctrl+Alt+I for Windows and Command+Option+I for Mac
  3. Run a line of code Ctrl+Enter for Windows and Command+Enter for Mac
  4. Run all the code within a chunk Ctrl+Shift+Enter for Windows and Command+Shift+Enter for Mac
  5. Insert a pipe operator (%>%) Ctrl+Shift+M for Windows and Command+Shift+M for Mac

Class 3 Tasks

Filter rows (Task time 15 mins)

  1. Select the first three columns (year, month, day), all other columns whose names contains() the term “delay” and the origin column. Filter this data to show all the flights that took off in the morning (before 12:00) from JFK in December. Make sure to use pipes between the select and filter command. Refer to the shortcut for inserting a pipe (see above).
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.1.0       v purrr   0.3.1  
## v tibble  2.0.1       v dplyr   0.8.0.1
## v tidyr   0.8.3       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0
## -- Conflicts --------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(nycflights13)
str(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  19 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num  1400 1416 1089 1576 762 ...
##  $ hour          : num  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
flights %>% 
  select(c(1, 2, 3, "origin", contains("dep"))) %>% 
  filter(month == 12 & origin == "JFK" & dep_time < 1200)
  1. Use top_n() to print the most delayed departures from NYC in 2013. Read the documentation for top_n() on tidyverse if you are confused.
flights %>% 
  top_n(., n = 5, wt = dep_delay)
  #five longest delays
  #dont use == because we are assigning value to parameter
  #not declaring math equality for sth evaluated
  1. Run the code below, read the error and fix the code so that it works
flights %>% 
    filter(month == 6 & day > 15)
  #needs a ==, had an =
  #we're not setting a param value here, it's an evaluation criteria
  1. Filter the flights that were between the 10th and 40th most delayed in terms of arrival (arr_delay) using the dense_rank() helper function.
flights %>% 
  filter(dense_rank(desc(arr_delay)) %in% 10:40)
  # %in% is a matching operator, see notes
  #ties are eliminated b/c dense_rank, ties broken by data order

Mutate a tibble (Task time 10 mins)

  1. Create a variable that indicates whether a flight took off in the AM or the PM.
flights %>% 
  mutate(takeoff = if_else(dep_time < 1200, "Flight is AM", "Flight is PM"))
  1. Use transmute() instead of mutate() to do the same. What is the difference between the two?
flights %>% 
  transmute(takeoff = if_else(dep_time < 1200, "Flight is AM", "Flight is PM"))

Grouped operations (Task time 30 mins)

  1. Find the top 10 airline carrier that had the highest average departure delays in 2013 using group_by(), summarise() and other functions you have learnt previously.
  2. Use group_by() with mutate() to create a new variable called comparativeDelay which is the difference between departure delay and the average delay in each origin airport for every hour in 2013 (checkout the time_hour variable in the flights data). Store the result in a variable called comparativeDelays.
  3. Now group the comparativeDelays tibble by carriers to print the top 10 airlines with the greatest average comparative delays.
  4. Use group_by() with filter to print the 5 most delayed flights from each origin. Your printed tibble should have 15 rows.
  5. The air authority in NY wants to penalize carriers for departure delays. Which of the three metrics (task 1, 3 or 4) would you recommend they use to identify the airlines to penalize. Why?

ggplot

  1. Make a scatterplot of your choice using any two numeric variables from the flights dataset